The purpose of this lab is to introduce linear regression using base R and the tidyverse. We work on a dataset provided by the MASS package. This dataset is investigated in the book by Venables and Ripley. This discusssion is worth being read. Our aim is to relate regression as a tool for data exploration with regression as a method in statistical inference. To perform regression, we will rely on the base R function lm() and on the eponymous S3 class lm. We will spend time understanding how the formula argument can be used to construct a design matrix from a dataframe representing a dataset.
Packages installation and loading (again)
Code
# We will use the following packages. # If needed, install them : pak::pkg_install(). stopifnot(require("magrittr"),require("lobstr"),require("ggforce"),require("patchwork"), require("gt"),require("glue"),require("skimr"),require("corrr"),require("GGally"),require("broom"),require("tidyverse"))
Besides the tidyverse, we rely on skimr to perform univariate analysis, GGally::ggpairs to perform pairwise (bivariate) analysis. Package corrr provide graphical tools to explore correlation matrices. At some point, we will showcase the exposing pipe %$% and the classical pipe %>% of magrittr. We use gt to display handy tables, patchwork to compose graphical objects. glue provides a kind of formatted strings. Package broom proves very useful when milking lienar models produced by lm() (and many other objects produced by estimators, tests, …)
Dataset
The dataset is available from package MASS. MASS can be downloaded from cran.
Code
whiteside <- MASS::whiteside # no need to load the whole packagecur_dataset <-str_to_title(as.character(substitute(whiteside)))# ?whiteside
The documentation of R tells us a little bit more about this data set.
Mr Derek Whiteside of the UK Building Research Station recorded the weekly gas consumption and average external temperature at his own house in south-east England for two heating seasons, one of 26 weeks before, and one of 30 weeks after cavity-wall insulation was installed. The object of the exercise was to assess the effect of the insulation on gas consumption.
This means that our sample is made of 56 observations. Each observation corresponds to a week during heating season. For each observation. We have the average external temperature Temp (in degrees Celsius) and the weekly gas consumption Gas. We also have Insul which tells us whether the observation has been recorded Before or After treatment.
Temperature is the explanatory variable or the covariate. The target/response is the weekly Gas Consumption. We aim to predict or to explain the variations of weekly gas consumption as a function average weekly temperature.
The question is wether the treatment (insulation) modifies the relation between gas consumption and external temperature, and if we conclude that the treatment modifies the relation, in which way?.
Even though the experimenter, Mr Whiteside, decided to apply a treatment to his house. This is not exactly what we call experimental data. Namely, the experimenter has no way to clamp the external temperature. With respect to the Temperature variable (the explanatory variable) we are facing observational data.
Columnwise exploration
Question
Before before proceeding to linear regressions of Gas with respect to Temp, perform univariate analysis on each variable.
Compute summary statistics
Build the corresponding plots
Solution
skimr does the job. There are no missing data, complete rate is always 1, we remove non-informative columns from the output.
Registered S3 method overwritten by 'quantmod':
method from
as.zoo.data.frame zoo
Whiteside dataset
Var
statistic
p.value
parameter
method
Temp
1.37
0.50
2.00
Jarque Bera Test
Gas
2.74
0.25
2.00
Jarque Bera Test
Both variables also pass the Jarque-Bera test with flying colors. This is noteworthy since the Jarque-Bera compares empirical skewness and kurtosis to Gaussian skewness and kurtosis.
Pairwise exploration
Question
Compare distributions of numeric variables with respect to categorical variable Insul
Solution
We start by plotting histograms
To abide to the DRY principle, we take advantage of the fact that aesthetics and labels can be tuned incrementally.
Code
p_xx <- whiteside |>ggplot() +geom_histogram(mapping=aes(fill=Insul, color=Insul, y=after_stat(density)),position="dodge",alpha=.1,bins=10) p_1 <- p_xx +aes(x=Temp) +theme(legend.position ="None")+labs(title="External Temperature",subtitle ="Weekly Average (Celsius)")p_2 <- p_xx +aes(x=Gas) +labs(title="Gas Consumption" ,subtitle="Weekly Average")# patchwork the two graphical objects (p_1 + p_2) + patchwork::plot_annotation(caption="Whiteside dataset from package MASS" )
Note the position parameter for geom_histogram(). Check the result for position="stack", position="identity". Make up your mind about the most convenient choice.
Gas Consumption distribution After looks shifted with respect to distribution Before.
Another visualization strategy consists in using the faceting mechanism
# Compute the covariance matrix for Gas and TempC <- whiteside |>select(where(is.numeric)) |>cov()
Computing correlations and correlations per groups is revealing.
Code
# use magrittr pipe %>% to define a pipeline functionmy_cor <- . %>%summarize(pearson=cor(Temp, Gas, method="pearson"),kendall=cor(Temp, Gas, method="kendall"),spearman=cor(Temp,Gas, method="spearman") ) t1 <- whiteside |>group_by(Insul) |>my_cor()t2 <- whiteside |>my_cor() |>mutate(Insul="pooled")bind_rows(t1, t2) |>gt() |> gt::fmt_number(decimals=2) |> gt::tab_caption("Whiteside data: correlations between Gas and Temp")
Whiteside data: correlations between Gas and Temp
Insul
pearson
kendall
spearman
Before
−0.97
−0.85
−0.96
After
−0.90
−0.72
−0.86
pooled
−0.68
−0.47
−0.62
Note the sharp increase of all correlation coefficients we data are grouped according to the control/treatment variable (Insul).
Use ggpairs from GGally to get a quick overview of the pairwise interactions.
Code
whiteside |> GGally::ggpairs()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Question
Build a scatterplot of the Whiteside dataset
Solution
Code
p <- whiteside %>%ggplot() +aes(x=Temp, y=Gas) +geom_point(aes(shape=Insul)) +xlab("Average Weekly Temperature (Celsius)") +ylab("Average Weekly Gas Consumption 1000 cube feet") +labs(## Use list unpacking )p +ggtitle(glue("{cur_dataset} dataset"))
Note that the dataset looks like the stacking of two bananas corresponding to the two heating seasons.
In simple linear regression, the slope follows from the covariance matrix in a straightforward way. The slope can also be expressed as the linear correlation coefficient (Pearson) times the ration between the standard deviation of the response variable and the standard deviation of the explanatory variable.
lm stands for Linear Models. Function lm has a number of arguments, including:
formula
data
Question
Use lm() to compute slope and intercept
Solution
Code
lm0 <-lm(Gas ~ Temp, data = whiteside)
The result is an object of class lm.
The generic function summary() has a method for class lm
Code
lm0 %>%summary()
Call:
lm(formula = Gas ~ Temp, data = whiteside)
Residuals:
Min 1Q Median 3Q Max
-1.6324 -0.7119 -0.2047 0.8187 1.5327
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.4862 0.2357 23.275 < 2e-16 ***
Temp -0.2902 0.0422 -6.876 6.55e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.8606 on 54 degrees of freedom
Multiple R-squared: 0.4668, Adjusted R-squared: 0.457
F-statistic: 47.28 on 1 and 54 DF, p-value: 6.545e-09
The summary is made of four parts
The call. Very useful if we handle many different models (corresponding to different formulae, or different datasets)
A numerical summary of residuals
A commented display of the estimated coefficients
Estimate of noise scale (under Gaussian assumptions)
Squared linear correlation coefficient between response variable \(Y\) (Gas) and predictions \(\widehat{Y}\)
A test statistic (Fisher’s statistic) for assessing null hypothesis that slope is null, and corresponding \(p\)-value (under Gaussian assumptions).
Including a rough summary in a report is not always a good idea. It is easy to extract tabular versions of the summary using functions tidy() and glance() from package broom.
For html output gt::gt() allows us to polish the final output
Solution
We can use the exposing pipe from magrittr (or the with construct from base R) to build a function that extracts the coefficients estimates, standard error, \(t\)-statistic and associated p-values.
Code
tidy_lm <- . %$% ( # <1> The lhs is meant to be of class lmtidy(.) %>%# <2> . acts as a pronoun for magrittr pipes gt::gt() %>% gt::fmt_number(decimals=2) %>% gt::tab_caption(glue("Linear regrression. Dataset: {call$data}, Formula: {deparse(call$formula)}")) # <3> call is evaluated as a member of the pronoun `.`)tidy_lm(lm0)
Linear regrression. Dataset: whiteside, Formula: Gas ~ Temp
term
estimate
std.error
statistic
p.value
(Intercept)
5.49
0.24
23.28
0.00
Temp
−0.29
0.04
−6.88
0.00
deparse() is an important function from base R. It is very helpful when trying to take advantage of lazy evaluation mechanisms.
TODO: more on glue()
Question
Function glance() extract informations that can be helpful when performing model/variable selection.
Solution
The next chunk handles several other parts of the summary.
sigma estimates the noise standard deviation (under homoschedastic Gaussian noise assumption)
statistic is the Fisher statistic used to assess the hypothesis that the slope (Temp coefficient) is zero. It is compared with quantiles of Fisher distribution with 1 and 54 degrees of freedom (check pf(47.28, df1=1, df2=54, lower.tail=F) or qf(6.55e-09, df1=1, df2=54, lower.tail=F)).
Question
R offers a function confint() that can be fed with objects of class lm. Explain the output of this function.
Solution
Under the Gaussian homoschedastic noise assumption, confint() produces confidence intervals for estimated coefficients. Using the union bound, we can derive a conservative confidence rectangle.
Code
M <-confint(lm0, level=.95) as_tibble(M) |>mutate(Coeff=rownames(M)) |>relocate(Coeff) |> gt::gt() |> gt::fmt_number(decimals=2)
Coeff
2.5 %
97.5 %
(Intercept)
5.01
5.96
Temp
−0.37
−0.21
Question
Plot a \(95\%\) confidence region for the parameters (assuming homoschedastic Gaussian noise).
Solution
Diagnostic plots
Method plot.lm() of generic S3 function plot from base R offers six diagnostic plots. By default it displays four of them.
Question
What are the diagnostic plots good for?
Solution
Code
plot(lm0)
The motivation and usage of diagnostic plots is explained in detail in the book by Fox and Weisberg: An R companion to applied regression.
The diagnostic plots can be built from the information gathered in the lm object returned by lm(...).
It is convenient to extract the required pieces of information using method augment.lm. of generic functionaugment() from package broom.
(diag_1 + diag_2 + diag_3 +guide_area()) +plot_layout(guides="collect") +plot_annotation(title=glue("{cur_dataset} dataset"),subtitle =glue("Regression diagnostic {deparse(lm0$call$formula)}"), caption ='The fact that the sign of residuals depend on Insul shows that our modeling is too naive.\n The qqplot suggests that the residuals are not collected from Gaussian homoschedastic noise.' )
Taking into account Insulation
Design a formula that allows us to take into account the possible impact of Insulation. Insulation may impact the relation between weekly Gas consumption and average external Temperature in two ways. Insulation may modify the Intercept, it may also modify the slope, that is the sensitivity of Gas consumption with respect to average external Temperature.
Have a look at formula documentation (?formula).
Solution
Code
lm1 <-lm(Gas ~ Temp * Insul, data = whiteside)
Check the design using function model.matrix(). How can you relate this augmented design and the one-hot encoding of variable Insul?
Column .sigma contains the leave-one-out estimates of \(\sigma\), that is whiteside_aug1$.sigma[i] is the estimate of \(\sigma\) you obtain by leaving out the i row of the dataframe.
There is no need to recompute everything for each sample element.